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£S) ' Abstract 

This paper presents several novel theoretical results regarding the recovery of a low-rank 
matrix from just a few measurements consisting of linear combinations of the matrix entries. 
We show that properly constrained nuclear-norm minimization stably recovers a low-rank matrix 
from a constant number of noisy measurements per degree of freedom; this seems to be the first 
result of this nature. Further, the recovery error from noisy data is within a constant of three 
targets: 1) the minimax risk, 2) an 'oracle' error that would be available if the column space of 
the matrix were known, and 3) a more adaptive 'oracle' error which would be available with the 
■ knowledge of the column space corresponding to the part of the matrix that stands above the 

£C) , noise. Lastly, the error bounds regarding low-rank matrices are extended to provide an error 

C*~) ■ bound when the matrix has full rank with decaying singular values. The analysis in this paper 

is based on the restricted isometry property (RIP) introduced in [5] for vectors, and in for 
matrices. 
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1 Introduction 

Low-rank matrix recovery is a burgeoning topic drawing the attention of many researchers in the 
closely related field of sparse approximation and compressive sensing. To draw an analogy, in the 
sparse approximation setup, the signal y is modeled as a sparse linear combination of elements from 
a dictionary D so that 

y = Dx, 

where x is a sparse coefficient vector. The goal is to recover x. In the matrix recovery problem, the 
signal to be recovered is a low-rank matrix M G [R n i xn 2 ) about which we have information supplied 
by means of a linear operator A : R niXn2 — > R m (typically, m is far less than n\ri2), 

y = A(M), y e R rn . 
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In both cases, signal recovery appears to be an ill-posed problem because there are many more 
unknowns than equations. However, as has been shown extensively in the sparse-approximation 
literature, the assumption that the object of interest is sparse makes this problem meaningful even 
when the linear system of equations is apparently under deter mined. Further, when the measure- 
ments are corrupted by noise, we now know that by taking into account the parsimony of the 
model, one can insure that the recovery error is within a log factor of the error one would achieve 
by regressing y onto the low-dimensional subspace spanned by those columns with xi 7^ 0; the 
squared error is adaptive, and proportional to the true dimension of the signal |3l[7]- 

In this paper, we derive similar results for matrix recovery. In contrast to results available in 
the literature on compressive sensing or sparse regression, we show that the error bound is within a 
constant factor (rather than a log factor) of an idealized 'oracle' error bound achieved by projecting 
the data onto a smaller subspace given by the 'oracle' (and also within a constant of the minimax 
error bound). This error bound also applies to full-rank matrices (which are well- approximated by 
low-rank matrices), and there appears to be no analogue of this in the compressive sensing world. 

Another contribution of this paper is to lower the number of measurements to stably recover 
a matrix of rank r by convex programming. It is not hard to see that we need at least m > 
{rii + ri-2 — r)r measurements to recover matrices of rank r, by any method whatsoever. To be sure, 
if m < (n\ + ri2 — r)r, we will always have two distinct matrices M and M' or rank at most r 
with the property A(M) = A(M') no matter what A is. To see this, fix two matrices U G R niXr , 
V G R n2Xr with orthonormal columns, and consider the linear space of matrices of the form 

T = {UX* -YV* :X£ R" 2Xr ,y G R" lXr }. 

The dimension of T is r{n\ +112— r). Thus, if m < (n\+ri2—r)r, there exists M = UX* — YV* 7^ in 
T such that A(M) = 0. This proves the claim since A(UX*) = AiYV*) for two distinct matrices of 
rank at most r. Now a novel result of this paper is that, even without knowing that M G T, one can 
stably recover M from a constant times {n\ + ri2)r measurements via nuclear-norm minimization. 
Once again, in contrast to similar results in compressive sensing, the number of measurements 
required is within a constant of the theoretical lower limit - there is no extra log factor. 

1.1 A few applications 

Following a series of advances in the theory of low-rank matrix recovery from undersampled linear 
measurements [5ll8V H0^[16Hl81[20l l22| . a number of new applications have sprung up to join ranks 
with the already established ones. A quick survey shows that low-rank modeling is getting very 
popular in science and engineering, and we present a few eclectic examples to illustrate this point. 

• Quantum state tomography |15j . In quantum state tomography, a mixed quantum state 
is represented as a square positive semidefinite matrix, M (with trace 1). If M is actually a 
pure state, then it has rank 1, and more generally, if it is approximately pure then it will be 
well approximated by a low-rank matrix |15j . 

• Face recognition [2l f5]. Here the sequence of signals {?/,} are images of the same face under 
varying illumination. In theory and under idealized circumstances (the images are assumed to 
be convex, Lambertian objects), these faces all reside near the same nine-dimensional linear 
subspace [2j. In practice, face-recognition techniques based on the assumption that these 
images reside in a low-dimensional subspace are highly successful [2j[5]. 
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• Distance measurements. Let x% S R be a sequence of vectors representing several 
positions in d dimensional space, and let M be the matrix of squared distances between 
these vectors, Mjj = — ■ Then M has rank bounded by d + 2. To see this, let 
X = [x\,X2, ■ ■ ■ ,x n ] be a concatenation of the positions vectors. Then, letting {a} be the 
standard basis vectors, 

Mij = (a - ej)*X*X(ei - ej) = -2e*X*Xej + e*tq*e j + e*qt*ej, 

where 1 is a vector containing all ones, and q is a vector with qi = Thus M = 

X*X + Iq* + qt*. The first matrix has rank bounded by d and the second two have rank 
bounded by 1. In fact, one can project out tq* + qt* in order to reduce the rank to d, which 
in usual applications will be 2 for positions constrained to lie in the plane, and 3 for positions 
constrained to lie somewhere in space. 

Quantum state tomography lends itself perfectly to the compressive sensing framework. On an 
abstract level, one sees measurements consisting of linear combinations of the unknown quantum 
state M - inner products with certain observables which can be chosen with some flexibility by the 
physicist - and the goal is to recover a good approximation of M. The size of M grows exponentially 
with the number of particles in the system, so one would like to use the structure of M to reduce 
the number of measurements required, thus necessitating compressive sensing (see [15] for a more 
in depth discussion and a specific analysis of this problem) . Another more established example is 
sensor localization, in which one sees a subset of the entries of a distance matrix because the sensors 
have low power and can only sense reliably its distance to nearby sensors. The goal is to fill in 
the missing entries (matrix completion). In some applications of the face recognition example, one 
would see the entire set of faces (the sampling operator is the identity) , and the low-rank structure 
can be used to remove sparse errors, but otherwise arbitrarily gross, from the data as described 
in [5] (we include this example to illustrate the different uses of the low-rank matrix model, but 
also note that it is quite different than the problem addressed in our paper). 

1.2 Prior literature 

There has recently been an explosion of literature regarding low-rank matrix recovery, with special 
attention given to the matrix completion subproblem (as made famous by the million dollar Netflix 
Prize). Several different algorithms have been proposed, with many drawing their roots from 
standard compressive sensing techniques [3II^I10 | I12 | I16 | [TT ]I20H22] . For example, nuclear-norm 
minimization is highly analogous to l\ minimization (as a convex relaxation to an intractable 
problem), and the algorithms analyzed in this paper are analogous to the Dantzig Selector and the 
LASSO. 

The theory regarding the power of nuclear-norm minimization in recovering low-rank matrices 
from undersampled measurements began with a paper by Recht et al. [22], which sought to bridge 
compressive-sensing with low-rank matrix recovery via the RIP (to be defined in Section I2.1|) . 
Subsequently, several papers specialized the theory of nuclear-norm minimization to the matrix 

1 An interesting point about quantum state tomography is that if one enforces the constraints trace(M) = 1 and 
M y then this ensures that ||M||» = 1, and the scientist is left with a feasibility problem. In [15] the authors 
suggest to solve this feasibility problem by removing a constraint and then performing nuclear-norm minimization 
and they show that under certain conditions this is sufficient for exact recovery (and thus of course the solution obeys 
the unenforced constraint). 
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completion problem [5^[8T410[[T4"] which turns out to be 'RIPless'; this literature is motivated by 
very clear applications such as recommender systems and network localizations, and has required 
very sophisticated mathematical techniques. 

With the recent increase in attention given to the low-rank matrix model, which the authors 
surmise is due to the spring of new theory, new applications are being quickly discovered that 
deviate from the matrix completion setup (such as quantum state tomography an d could 

benefit from a different analysis. Our paper returns to measurement ensembles obeying the RIP as 
in [22] , which are of a different nature than those involved in matrix completion. As in compressive 
sensing, the only known measurement ensembles which provably satisfy the RIP at a nearly minimal 
sampling rate are random (such as the Gaussian measurement ensemble in Section [2.ip Having said 
this, two comments are in order. First, our results provide an absolute benchmark of what is 
achievable, thus allowing direct comparisons with other methods and other sampling operators A. 
For instance, one can quantify how far the error bounds for the RIPless matrix completion are 
from what is then known to be essentially unimprovable. Second, since our results imply that 
the restricted isometry property alone guarantees a near-optimal accuracy, we hope that this will 
encourage more applications with random ensembles, and also encourage researchers to establish 
whether or not their measurements obey this desirable property. Finally, we hope that our analysis 
offers insights for applications with nonrandom measurement ensembles. 



1.3 Problem setup 

We observe data y from the model 



y = A(M) + z, 



(1.1) 



where M is an unknown n\ x n<i matrix, A : R niXn2 — >■ R m is a linear mapping, and z is an Tri- 
dimensional noise term. The synthesized versions of our error bounds assume that z is a Gaussian 
vector with i.i.d. A/"(0, a 2 ) entries. The goal is to recover a good approximation of M while requiring 
as few measurements as possible. 

We pause to demonstrate the form of A(X) explicitly: the ith entry of A(X) is [,4.(X)]j = 
(Ai,X) for some sequence of matrices {Ai} and with the standard inner product (A, X) = trace(^4*X). 
Each Ai can be likened to a row of a compressive sensing matrix, and in fact it can aid the intuition 
to think of A as a large matrix, i.e. one could write A(X) as 



A(X) 



vec(^4i) 
vec(^2) 

vec(-4 m 



vec(X), 



(1.2) 



where vec(X) is a long vector obtained by stacking the columns of X. In the common matrix 
completion problem, each Ai is of the form e^e* so that the ith component of A(X) is of the form 



(e k e*,M) 



e* k Me 3 



Mkj for some (j, k). 
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1.4 Algorithms 

To recover M, we propose solving one of two nuclear-norm-minimization based algorithms. The 
first is an analogue to the Dantzig Selector from compressive sensing [TJ, defined as follows: 



minimize 
subject to 



II 

U*(r)\\<\ 
r = y-A(X), 



(1.3) 



where the optimal solution is our estimate M, \\ ■ || is the operator norm and || ■ ||* is its dual, i.e. the 
nuclear norm, and A* is the adjoint of A. We call this convex program the matrix Dantzig selector. 

To pick a useful value for the parameter A in (II. 3D . we stipulate that the 'true' matrix M 
should be feasible (this is a necessary condition for our proofs). In other words, one should have 
||-4*(-z)|| < A; Section [2T21 provides further intuition about this requirement. In the case of Gaussian 
noise, this corresponds to A = Cna for some numerical constant C as in the following lemma. 

Lemma 1.1 Suppose z is a Gaussian vector with i.i.d. N(0,o~ 2 ) entries and let n = max(ni,re2). 
Then if C > 4^(1 + £i) log 12 

\\A*(z)\\<C yfaT, (1.4) 

with probability at least 1 — 2e~ cn for a fixed numerical constant c > 0. 

This lemma is proved in Section [3] using a standard covering argument. The scalar di is the isometry 
constant at rankl, as defined in Section [2. lj. but suffice for now that it is a very small constant 
bounded by v2 — 1 (with high probability) under the assumptions of all of our theorems. 

The optimization program (|1.3|) may be formulated as a semidefinite program (SDP) and can 
thus be solved by any of the standard SDP solvers. To see this, we first recall that the nuclear 
norm admits an SDP characterization since ||AsT||* is the optimal value of the SDP 

minimize (trace(Wi) + trace(W 2 )) /2 



subject to 



Wi X 
X* w 2 



y o 



with optimization variables X,Wi,W2 € R nxn . Second, the constraint ||^4*(r)|| < A is an SDP 
constraint since it can be expressed as the linear matrix inequality (LMI) 



A/n 

[A*(r) 



A*{r) 

A/n 



>- 0. 



This shows that (11.31) can be formulated as the SDP 



minimize (trace(Wi) + trace(W / 2))/2 



subject to 



Wi X 
X* W 2 

XI n A*{r) 
[A*{r)\* XI n 

= y-A(X), 



y o 
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with optimization variables X, W\, W<i G tR nxn . 

However, a few algorithms have recently been developed to solve similar nuclear-norm mini- 
mization problems without using interior-point methods which work extremely efficiently in prac- 
tice [HET]. The nuclear- norm minimization problem solved using fixed-point continuation in |21j 
is an analogue to the LASSO, and is defined as follows: 

minimize — y\\j 2 + (1.5) 

We call this convex program the matrix Lasso and it is the second convex program whose theoretical 
properties are analyzed in this paper. 

1.5 Organization of the paper 

The results in this paper mostly concern random measurements and random noise and so they 
hold with high probability. In Section [2.H we show that certain classes of random measurements 
satisfy the RIP when only sampling a constant number of measurements per degree of freedom. In 
Section [2.21 we present the simplest of our error bounds, demonstrating that when the RIP holds, 
the solution to (|l,3p is within a constant of the minimax risk. This error bound is refined in Section 
12.31 to provide a more adaptive error that holds improvements when the singular values of M decay 
below the noise level. It is shown that this error bound is within a constant of the expected value 
of a certain 'oracle' error bound. In Section \2A\ we present an error bound handling the case when 
M has full rank but is well approximated by a low-rank matrix. Section [3] contains the proofs and 
we finish with some concluding remarks in Section HJ 

1.6 Notation 

We review all notation used in this paper in order to ease readability. We assume M G (R"-i xn 2 an d 
let n = max(ni, n?). A variety of norms are used throughout this paper: HA^I* is the nuclear norm 
(the sum of the singular values); ||X|| is the operator norm of X (the top singular value); ||AT||i? is 
the Frobenius norm (the l2-norm of the vector of singular values). The matrix X* is the adjoint 
of X, and for the linear operator A : R niXna ->■ R m , A* : R m -)• R n i xn ' 2 is the adjoint operator. 
Specifically, if [A(X)] t = (A^X) for all matrices X G R n ^ n ^ then 

m 

A*{q)=^qiAi 
i=l 

for all vectors q G (R m . 

2 Main Results 
2.1 Matrix RIP 

The matrix version of the RIP is an integral tool in proving our theoretical results and we begin 
by defining the RIP in this setting and describing measurement ensembles that satisfy it. To 
characterize the RIP, we introduce the isometry constants of a linear map A. 
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Definition 2.1 For each integer r = 1,2, ... ,n, the isometry constant S r of A is the smallest 
quantity such that 

(1 - 5 r )\\X\\ 2 F < \\A(X)\\l < (l + S r )\\X\\ F (2.1) 
holds for all matrices of rank at most r. 

We say that A satisfies the RIP at rank r if S r is bounded by a sufficiently small constant between 
and 1, the value of which will become apparent in further sections (see e.g. Theorem 12. 4p . 

Which linear maps A satisfy the RIP? As a quintessential example, we introduce the Gaussian 
measurement ensemble. 

Definition 2.2 A is a Gaussian measurement ensemble if each 'row' A{, 1 < i < m, contains 
i.i.d. A/"(0, 1/m) entries (and the Ai 's are independent from each other). 

This is of course highly analogous to the Gaussian random matrices in compressive sensing. Our 
first result is that Gaussian measurement ensembles, along with many other random measurement 
ensembles, satisfy the RIP when m> C nr (with high probability) for some constant C > 0. 

Theorem 2.3 Fix < 5 < 1 and let A be a random measurement ensemble obeying the following 
condition: for any given X 6 [R n i xn 2 an d an y fixed < t < 1, 

P(\\\A(X)\\% - \\X\\ 2 F \ > t\\Xf F ) < Cexp(-cm) (2.2) 

for fixed constants C, c > (which may depend on t). Then if m > Dnr, A satisfies the RIP with 
isometry constant S r < S with probability exceeding 1 — Ce~ dm for fixed constants D,d> 0. 

The many unspecified constants involved in the presentation of Theorem 12.31 are meant to allow 
for general use with many random measurement ensembles. However, to make the presentation 
more concrete we describe the constants involved in the concentration bound (|2.2|) for a few special 
random measurement ensembles. If A is a Gaussian random measurement ensemble, ||.A(X)||^ is 
distributed as m _1 ||X|||, times a chi-squared random variable with m degrees of freedom and (|2.2|) 
follows from standard concentration inequalities [Hl[22]. Specifically, we have 

P{\\\A(X)\\l - \\Xf F \ > t\\Xf F ) < 2exp (-| (i 2 /2 - t 3 /3)) . (2.3) 

Similarly, A satisfies equation (|2.3p in the case when each entry of each 'row' Ai has i.i.d. entries 
that are equally likely to take the value 1/y/m or —1/y/m, or if A is a random projection (!L]l22j. 
Further, A satisfies (|2.2[) if the 'rows' Ai contain sub-Gaussian entries (properly normalized) [23], 
although in this case the constants involved depend on the parameters of the sub-Gaussian entries. 

In order to ascertain the strength of Theorem 12.31 note that the number of degrees of freedom of 
annixn2 matrix of rank r is equal to r(m+n2— r)o Thus, one may expect that if m < r(n\+n2— r), 
there should be a rank-r matrix in the null space of A leading to a failure to achieve the lower 
bound in (|2.1j) . In order to make this intuition rigorous (to within a constant) assume without loss 
of generality that ri2 > n±, and observe that the set of rank-r matrices contains all those matrices 
restricted to have nonzero entries only in the first r rows. This is an n x r dimensional vector space 
and thus we must have m > nr or otherwise there will be a rank-r matrix in the null space of A 

2 This can be seen by counting the number of equations and unknowns in the singular value decomposition. 
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regardless of what measurements are used. (This is a similar alternative to the null-space argument 
posed in the introduction.) 

Theorem 12.31 is inspired by a similar theorem in [22] [Theorem 4.2] and refines this result in two 
ways. First, it shows that one only needs a constant number of measurements per degree of freedom 
of the underlying rank-r matrix in order to obtain the RIP at rank r (which improves on the result 
in [22] by a factor of logn and also achieves the theoretical lower bound to within a constant). 
Second, it shows that one must only require a single concentration bound on A, removing another 
assumption required in [22]. A possible third benefit is that the proof follows simply and quickly 
from a specialized covering argument. The novelty is in the method used to cover low-rank matrices. 

2.2 The matrix Dantzig selector and the matrix Lasso are nearly minimax 

In this section, we present our first and simplest error bound, which only requires that A satisfies 
the RIP. 

Theorem 2.4 Assume that rank(M) < r and let Mds be the solution to the matrix Dantzig 
selector (j 1 . 3j) and Ml be the solution to the matrix Lasso (|1.5j) . If 5^ < y/2 — 1 and ||^4*(z)|| < A 
then 

\\M DS -M\\ 2 F <C rX 2 , (2.4) 
and if5 4r < (3^2- 1)/17 and \\A*{z)\\ < fi/2, then 

||M L -M|||<Cir/i 2 ; (2.5) 

above, Co and C\ are small constants depending only on the isometry constant 5^ r . In particular, 
if z is a Gaussian error and M is either Ml>s with A = 8na, or Ml with f/, = 16no~, we have 

\\M-Mfp < C' nra 2 (2.6) 

with probability at least 1 — 2e~ cn for a constant C Q (depending only on 5± r ). 

Note that ()2.6p follows from ()2.4|) and (|2.5p simply by plugging in A, /i/2 = 8na into Lemma ll.ll 
In a nutshell, the error is proportional to the number of degrees of freedom times the noise level. 

An important point is that one may expect the error to be reduced when further measurements 
are taken i.e. one may expect the error to be inversely proportional to m. In fact, this is the case 
for the Gaussian measurement ensemble, but this extra factor is absorbed into the definition in 
order to normalize the measurements so that they satisfy the RIP. If instead, each row t Ai in the 
Gaussian measurement ensemble is defined to have i.i.d. standard normal entries, then by a simple 
rescaling argument (apply Theorem 12.41 to y/^/rn), the error bound reads 

\\M- M\\ 2 F < C' nra 2 /m. 

A second remark is that exploiting the low-rank structure helps to denoise. For example, if we 
measured every entry of M (a measurement ensemble with isometry constant 5 r = 0), but with 
each measurement corrupted by a A/"(0,<r 2 ) noise term, then taking the measurements as they are 
as the estimate of M would lead to an expected error equal to 

EIIM - Mill = n 2 a 2 . 
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Nuclear- norm mininnzation_| reduces this error by a factor of about n/r. 

The strength of Theorem 12.41 is that the error bound (I2.6P is nearly optimal in the sense that 
no estimator can do essentially better without further assumptions, as seen by lower-bounding the 
expected minimax error. 

Theorem 2.5 If z is a Gaussian error, then any estimator M(y) obeys 

sup E||M(y) -M\\ 2 F > — — nra 2 . (2.7) 

M:rank(M)<r 1 + Or 

In other words, the minimax error over the class of matrices of rank at most r is lower bounded by 
about nra 2 . 

Before continuing, it may be helpful to analyze the solutions to the matrix Dantzig selector 
and the matrix Lasso in a simple case in order to understand the error bounds in Theorem 12.41 
intuitively, and also to understand our choice of A and u. Suppose A is the identity so that changing 
the notation a bit, the model is Y = M + Z, where Z is an n x n matrix with i.i.d. Gaussian entries. 
We would like the unknown matrix M to be a feasible point, which requires that \\Z\\ < A (for 
example, if \\Z\\ > A, we already have problems when M = 0). It is well known that the top singular 
value of a square n x n Gaussian matrix, with per-entry variance a 2 , is concentrated around y/2na, 
and thus we require A > \/2na (this provides a slightly sharper bound than Lemma II. ip . Let 
T\(X) denote the singular value thresholding operator given by 

T X (X) = ^max^X) - \,0)u iV *, 
i 

where X = &i(X)uiV* is any singular value decomposition. In this simple setting, the solution 
to (jl.3p and (jl.5p can be explicitly calculated, and for A = u they are both equal to T\(M + Z). If 
A is too large, then T\{M + Z) becomes strongly biased towards zero, and thus (loosely) A should 
be as small as possible while still allowing M to be feasible for the matrix Dantzig selector (jl.3p . 
leading to the choice A ~ \/2na. 

Further, in this simple case we can calculate the error bound in a few lines. We have 

\\M-M\\ = \\T X (Y) - Y + Z\\ 

< \\T X (Y)-Y\\ + \\Z\\ 

< 2A 

assuming that A > \\Z\\. Then 

\\M - M||| < ||M - M|| 2 rank(M - M) 

<4A 2 rank(M-M). (2.8) 

Once again, assuming that A > \\Z\\, we have rank(M — M) < rank(M) + rank(M) < 2r. Plugging 
this in with A = Cy/na gives the error bound ()2.6|) . 

3 Of course if one sees all of the entries of the matrix plus noise, nuclear-norm minimization is unnecessary, and 
one can achieve minimax error bounds by truncating the singular values. 
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2.3 Oracle inequalities 

Showing that an estimator achieves the minimax risk is reassuring but is sometimes not considered 
completely satisfactory. As is frequently discussed in the literature, the minimax approach focuses 
on the worst-case performance and it is quite reasonable to expect that for matrices of general inter- 
est, better performances are possible. In fact, a recent trend in statistical estimation is to compare 
the performance of an estimator with what is achievable with the help of an oracle that reveals 
extra information about the problem. A good match indicates an overall excellent performance. 

To develop an oracle bound, assume w.l.o.g. that n<i > n\ so that n = 712, and consider the 
family of estimators defined as follows: for each n\ x r orthogonal matrix U, define 

M[U] = argmin{||y - A(M)\\ h :M = UR for some R}. (2.9) 

In other words, we fix the column space (the linear space spanned by the columns of the matrix 
U), and then find the matrix with that column space which best fits the data. Knowing the true 
matrix M, an oracle or a genie would then select the best column space to use as to minimize the 
mean-squared error (MSE) 

inf E||M- M[U}\\ 2 . (2.10) 

The question is whether it is possible to mimic the performance of the oracle and achieve a MSE 
close to (|2.10p with a real estimator. 

Before giving a precise answer to this question, it is useful to determine how large the oracle 
risk is. To this end, consider a fixed orthogonal matrix U, and write the least-squares estimate 
(1231) as 

M[U] := UHu(v), Hu = (AhAuyUy, 

where Ajj is the linear map 



and 



A V : K rxn R m 

R A(UR), 



A* v : R m -> R rxn 



(2-11) 



y ^ U*A*(y). 
Then decompose the MSE as the sum of the squared bias and variance 

E||Af-M[C/]|||. = ||bias|| 2 + variance 

= ||EM[*7] -Ml^ + Ellf/^z)!! 2 ,. 
The variance term is classically equal to 

E\\UUu(z)\\ 2 F = E \\Hu{z)\\ 2 F = a 2 tiace(H* u 'Hu) = o 2 trace((^^ c/ )- 1 ). 

Due to the restricted isometry property, all the eigenvalues of the linear operator A\jA\j belong to 
the interval [1 — 8 r , 1 + S r ], see Lemma 13.121 Therefore, the variance term obeys 

cj 2 trace((^^l C /)" 1 ) > 1 nra 2 . 

1 ~\~ Or 

For the bias term, we have 

EM[U] - M = U (A* u Au)~ 1 Al r A(M) - M, 
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which we rewrite as 

EM[U] — M = U(AuA u )~ 1 AuA((I - UU* + UU*)M) 



M 



= U{AuAu)~ 1 A}jA({! - UU*)M) + U{AvAu)~ 1 AvA u (U*M) - M 
= U(A* u A u y 1 AuA((I - UU*)M) -{I- UU*)M. 

Hence, the bias is the sum of two matrices: the first has a column space included in the span of 
the columns of U while the column space of the other is orthogonal to this span. Put Py± (M) = 
(I — UU*)M; that is, Pjj±(M) is the (left) multiplication with the orthogonal projection matrix 
(I-UU*). We have 



EM[U] - M\ 



IMAhAur'AbAiPuAMMl + 11^ 



(M)\\ 



> WPr: 



-u^(M)\\ 

To summarize, the oracle bound obeys 

M[U]\\ 



inf EIIM 

u 



2 > inf 

u 



2 1 



nra 
T+T r 



Now for a given dimension r, the best U - that minimizing the squared bias term or its proxy 
||P[/x(M)|| 2 7 - spans the top r singular vectors of the matrix M. Denoting the singular values of 
M by <7j(M), we obtain 



inf EIIM 

u 



M[U}\\'< 



> inf 

r 



nra 



which for convenience we simplify to 

inf E \\M - M\U]\\ 2 

u 



> -^mimV^MW 2 



(2.12) 



The right-hand side has a nice interpretation. Write the SVD of M as M = Ya=i &i(M)uiV* . 
Then if af(M) > na 2 , one should try to estimate the rank-1 contribution ai{M)uiV* and pay the 
variance term (which is about na 2 ) whereas if a 2 (M) < na 2 , we should not try to estimate this 
component, and pay a squared bias term equal to af(M). In other words, the right-hand side may 
be interpreted as an ideal bias- variance trade-off. 

The main result of this section is that the matrix Dantzig Selector and matrix Lasso achieve 
this same ideal bias- variance trade-off to within a constant. 

Theorem 2.6 Assume that rank(M) < r and let Mds be the solution to the matrix Dantzig 
selector f j 1 . 3 j) and Ml be the solution to the matrix Lasso (|1,5[) . Suppose z is a Gaussian error and 
let A = 16na and fi = 32na 2 . If 5^ < v2 — 1, then 



|| Mds - M\\ 2 F < C ^2mm(a 2 (M),na 2 

i 

and if 5i r < (3\/2 — 1)/17, then 

\\M L - M||| < C x min((T? (M),na 2 



(2.13) 



(2.14) 



with probability at least 1 — 2e cn for constants Cq and C\ (depend only on 5± r ). 
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In other words, not only does nuclear-norm minimization mimic the performance that one would 
achieve with an oracle that gives the exact column space of M (as in Theorem I2.5f) . but in fact the 
error bound is within a constant of what one would achieve by projecting onto the optimal column 
space corresponding only to the significant singular values. 

While a similar result holds in the compressive sensing literature [TJ, we derive the result here 
using a novel technique. We use a middle estimate M which is the optimal solution to a certain 
rank-minimization problem (see Section [3]) and is provably near M and M. With this technique, 
the proof is a fairly simple extension of Theorem 12.41 

2.4 Extension to full-rank matrices 

In some applications, such as sensor localization, M has exactly low rank, i.e. only the top few of 
its singular values are nonzero. However, in many applications, such as quantum state tomography, 
M has full rank, but is well approximated by a low-rank matrix. In this section, we demonstrate 
an extension of the preceding error bound when M has full rank. 
First, suppose m < ri2 and note that a result of the form 

m 

||M-M||| < C^min(cr 2 (M),ncT 2 ) (2.15) 

i=l 

would be impossible when undersampling M because it would imply that as the noise level a 
approaches zero, an arbitrary full-rank n x n matrix could be exactly reconstructed from fewer 
than n 2 linear measurements. Instead, our result essentially splits M into two parts, 

f ni 
M = a *( M ) u i v i + Yl a i( M ) u i v i =M f + M c 

i=l i=f+l 

where f ~ m/n, and Mf is the best rank-f approximation to M. The error bound in the theorem 
reflects a near-optimal bias- variance trade-off in recovering Mf, but an inability to recover M c (and 
indeed the proof essentially considers M c as non-Gaussian noise). Note that f(m + ni — f) is of 
the same order as m so that the part of the matrix which is well recovered has about as many 
degrees of freedom as the number of measurements. In other words, even in the noiseless case 
this theorem demonstrates instance optimality i.e. the error bound is proportional to the norm of 
the part of M that is irrecoverable given the number of measurements (see [25] for an analogous 
result in compressive sensing). In the noisy case there does not seem to be any current analogue 
to this error bound in compressive sensing, although the detailed analysis can be translated to the 
compressive sensing problem and the authors are currently writing a short paper containing this 
result. 

Theorem 2.7 Fix M . Suppose that A is sampled from the Gaussian measurement ensemble with 
m < con 2 / log(m/n) and let f < c\m/n for some fixed numerical constants cq and c\. Let M be the 
solution to the matrix Dantzig selector (|1.3p with A = 16-y/ncr or the solution to the matrix Lasso 
(jl.5l) with fi = 32-y/ncr. Then 

Cf n \ 

^min(a 2 (M),mT 2 )+ ]T a 2 {M)\ (2.16) 
i=l t=f+l / 
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with probability greater than 1 — De for fixed numerical constants C,D,d > 0. Roughly, the 
same conclusion extends to operators obeying the NNQ condition, see below. 

An interesting note is that in the noiseless case this error bound provides a case of 'instance 
optimality' 

First note that r is small enough so that the RIP holds with high probability (see Lemma 
12. 3[) . However, the theorem requires more than just the RIP. The other main requirement is a 
certain NNQ condition, which holds for Gaussian measurement ensembles and is introduced in 
Section O It is an analogous requirement to the LQ condition introduced by Wojtaszczyk [25] in 
compressive sensing. To keep the presentation of the Theorem simple, we defer the explanation of 
the NNQ condition to the proofs section and simply state the theorem for the Gaussian measurement 
ensemble. However, the proof is not sensitive to the use of this ensemble (for example sub-Gaussian 
measurements yield the same result). Many generalizations of this Theorem are available and the 
lemmas necessary to make such generalizations are spelled out in Section O 

The assumption that m < en 2 / log(m/n) seems to be an artifact of the proof technique. Indeed, 
one would not expect further measurements to negatively impact performance. In fact, when 
m > c'n 2 for a fixed constant c', one can use Lemma 13.21 from Section [3] to derive the error 
bound ()2.16p (with high probability), leaving the necessity for a small 'patch' in the theory when 
en 2 / log(m/n) < n < c'n 2 . However, our results intend to address the situation in which M is 
significantly undersampled, i.e. m <C n 2 , so the requirement that m < cn 2 /\og(m/n) should be 
intrinsic to the problem setup. 



3 Proofs 

The proofs of several of the theorems use e-nets. For a set S, an e-net S e with respect to a norm 
|| • || satisfies the following property: for any v £ S, there exists vq E S e with ||wo — v\\ < e. In other 
words, S e approximates S to within distance e with respect to the norm || • ||. As shown in |24j . 
there always exists an e-net S e satisfying S t C S and 

Vd{S+$D) 
Vol (ID) 

where \D is an e/2 ball (with respect to the norm || • ||) and S + \D = {x + y : x £ S, y G \D}. In 
particular, if 5 is a unit ball in n dimensions (with respect to the norm || • ||) or if it is the surface 
of the unit ball or any other subset of the unit ball, then S + \D is contained in the 1 + e/2 ball, 
and the thus 

where the last inequality follows because we always take e < 1. See [21] for a more detailed 
argument. We will require in all of our proofs that S e C S. 



3.1 Proof of Lemma 11.11 

We assume that a = 1 without loss of generality. Put Z = A*(z). The norm of Z is given by 

\\Z\\ = sup (w, Zv), 
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where the supremum is taken over all pairs of vectors on the unit sphere S n . Consider a 1/4- net 
A/"i/ 4 of 5 n_1 with |A/" 1/4 | < 12". For each v, w G S n "\ 

(to, Zu) = (w - w , Zv) + (w , Z(v - v )) + (w , Zvq) 
< \\Z\\\\w - w Q \\i 2 + \\Z\\\\v - v \\ t2 + (to , Zv ) 

for some vo,wq G A/1/4 obeying ||« — vo\\i 2 < 1/4, ||to — too||^ 2 < 1/4. Hence, 

\\Z\\ < 2 sup (wq,Zvq). 

Now for a fixed pair (-00,100), 

(wo,Zvo) = trace(t0o^4*(^)«o) = trace(uot0o^4*(^)) = (woVq,A*(z)) = (A(woVq),z). 
We deduce from this that (w ,Zv ) ~Af(0, ||^(to Uo)||| )• Now 

\\A( Wo v* )\\i<(l + 6 1 )\\w v* Q \\ 2 F = (l + 6i) 
so that by a standard tail bound for Gaussian random variables 

1 A 2 

P(\(w ,Zv )\ > A) < 2e 21 +*i. 

Therefore, 

P(max|(u;o,^o)| > 7N /(1 + fcjn) < 2\M 1/A \ 2 e-^ 2n < 2 e 2 " lo s 12 -h 2 ", 
which is bounded by 2e~ cn with c = 7 2 /2 — 2 log 12 (we require 7 > 2-^/log 12 so that c > 0). 

3.2 Proof of Theorem I2~31 

The proof uses a covering argument, starting with the following lemma. 

Lemma 3.1 (Covering number for low-rank matrices) Let S r = {X G [R™i xn 2 ■ r ank(X) < 
r, = 1}. Then there exists an e-net S r for the Frobenius norm obeying 

\S r \ < (9/e)(" 1+n2+1 ) r . 

Proof Recall the SVD X = UT,V* of any X G S r obeying = 1. Our argument constructs 

an e-net for S r by covering the set of permissible U, V and S. We work in the simpler case where 
n i = n 2 = n since the general case is a straightforward modification. 

Let D be the set of diagonal matrices with nonnegative diagonal entries and Frobenius norm 
equal to one. We take D to be an e/3-net for D with \D\ < (9/e) r . Next, let O n , r = {U G K nxr : 
U*U = I}. To cover O n ^ r , it is beneficial to use the || • Hi^ norm defined as 

||X||i i2 = max ||Xi||f 2 , 

i 

where X{ denotes the ith column of X. Let Q n ^ r = {X G R nxr : 1 1 J 5 ^ 1 1 1 ,2 < 1}- It is easy to see that 
On,r C Q n ,r since the columns of an orthogonal matrix are unit normed. We have seen that there is 
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an e/3-net O n>r iorO n , r obeying |O n , r | < (9/e) nr . We now let S r = {UT.V* : U, V G O n>r , S G £>}, 
and remark that |S r | < |O n , r | 2 |5| < (9/e)( 2n+1 ) r . It remains to show that for all X G S r there 
exists X £ S r with \\X — X\\ F < e. 

Fix X £ S r and decompose X as X = UT,V* as above. Then there exist X = UTiV* £ S r with 
U,V G O n , r , £ G L> obeying ||C7- C7||i )2 < e/3, ||V - F||i,2 < e/3, and ||S-S|| F < e/3. This gives 

||X - X|| F = ||{7EF" - C7SF*|| F 

= \\uev* - ut,v* + j/ef* - tJT,V* + itev* - UY,V*\\ F 

< ||(C/- E/)EV*||j? + \\U(E - E)V*\\ F + [|C7X!(V r - V")*||f- (3.1) 

For the first term, note that since V is an orthogonal matrix, \\(U — C/)Sy*|| F = \\(U — [7)E|| F , 
and 



\\(U-U)X\\%= VUUi-Ui 



l<i<r 

<wm 2 F \\u-u\\i2 

< (e/3) 2 - 

Hence, \\(U - U)ZV*\\ F < e/3. The same argument gives \\UE(V - V)*\\ F < e/3. To bound the 
middle term, observe that ||0"(£ — S)F*||i? = ||£ — E|| F < e/3. This completes the proof. ■ 

We now prove Theorem 12.31 It is a standard argument from this point, and is essentially the 
same as the proof of Lemma 4.3 in [22] , but we repeat it here to keep the paper self-contained. 
We begin by showing that A is an approximate isometry on the covering set S r . Lemma 13. II with 
e = 5 /(Ay/2) gives 

\S r \ < (36v / 2/5) (ni+n2+1)r . (3.2) 



Then it follows from (|2.2p together with the union bound that 
P { max ||U(X)||? - \\Xf F \ > 6/2} < \S r \Ce~ cm 

\X&S r J 

< 2(36v / 2/5) (ni+n2+1)r Ce- cm 
= C exp ((m +n 2 + l)rlog(36V2/<y) — cm 

< 2exp(— dm) 

where d = c — 1 °g( 3 ^ / ^/' 5 ) anc [ we plugged in both requirements m > C(m + ra 2 + l)r and C > 
log(36^/2/<5)/c. 

Now suppose that 

max\\\A(X)\\l - \\X\\ 2 F \ < 5/2 
xes r 

(which occurs with probability at least 1 — Cexp(— dm)). We begin by showing that the upper 
bound in the RIP condition holds. Set 

K r = SUp ||^4(X)||^ 2 . 

xes r 
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For any X G S r , there exists X G S r with \\X — X\\ F < J/(4\/2) and, therefore, 

\\A(X)\\ ta < \\A(X - X)\U 2 + \\A(X)\U 2 < \\A(X - X)\\ e2 + 1 + 5/2. (3.3) 

Put AX = X - X and note that rank(AX) < 2r. Write AX = AXi + AX 2 , where (AXi, AJ 2 ) = 
0, and rank(AXj) < r, % = 1, 2 (for example by splitting the SVD). Note that AX 1( /|| AXi \\ F , 
AX 2 /\\AX 2 \\f G 5"r and, thus, 

P(AX)||, 2 < P(AXi)|| £2 + \\A(AX 2 )\\ e2 < KriWAX^F + ||AX 2 || F ). (3.4) 

Now ||AXl||f + ||AX 2 ||f < V2\\AX\\ F which follows from ||AXi||| + ||AX 2 ||| = ||AX|||. Also, 
||AX|| F < 8/{A\/2) leading to P(AX)|| £a < 6 /4. Plugging this into ([33]) gives 

\\A(X)\\ e2 < Kr 5/4 + 1 + 5/2. 

Since this holds for all X G 5 r , we have K r < K r 5/i+l+5/2 and thus k t < (l+5/2)/(l-<5/4) < 1+5 
which essentially completes the upper bound. Now that this is established, the lower bound now 
follows from 

||.4p0lk > M(^)lk 2 - \\AAX\\ i2 > 1 - 6/2 - (1 + 5)V25/(aV2) > 1 - 5. 
Note that we have shown 

(1-5)\\X\\ F <\\A(X)\U 2 <(1 + 5)\\X\\ F , 
which can then be easily translated into the desired version of the RIP bound. 

3.3 Proof of Theorem Q 

We prove Theorems 12.41 12. 6| and 12.71 for the matrix Dantzig selector (|1.3|) and describe in Section 
13. 71 how to extend these proofs to the matrix Lasso. We also assume that we are dealing with square 
matrices from this point forward (n = n\ = n 2 ) for notational simplicity; the generalizations of the 
proofs to rectangular matrices are straightforward. 

We begin by a lemma, which applies to full-rank matrices, and contains Theorem 12. 41 as a special 
case@ 

Lemma 3.2 Suppose 5^ r < \[2 — 1 and let M r be any rank-r matrix. Let M c = M — M r . Suppose 
X obeys \\A*(z)\\ < A. Then the solution M to (jl.3h obeys 

\\M-M\\ F < C ^r\ + Ci\\M c \\*/r, (3.5) 

where Co and C\ are small constants depending only on the isometry constant 8± r . 

We shall use the fact that A maps low-rank orthogonal matrices to approximately orthogonal 
vectors. 

Lemma 3.3 Jjf For all X, X' obeying (X,X') = 0, and rank(X) < r, rank(X') < r' , 

\{A(X),A(X'))\<5 r+rl \\X\\ F \\X'\\ F . 

4 We did not present this lemma in the main portion of the paper because it does not seem to have an intuitive 
interpretation. 
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Proof This is a simple application of the parallelogram identity. Suppose without loss of generality 
that X and X' have unit Frobenius norms. Then 



(1 - 5 r+J ,)\\X ± X% < U(X ± X'Wf < (! + <W)II* ± ^'IIf, 

since rank(X ± X') < r + r' '. We have \\X ± -X^ll = ||-X'|||' + ||^'||| = 2 and the parallelogram 
identity asserts that 

\(A(X),A(X'))\ = \\\\A(X + X')\\l-\\A(X-X')\\l\< 5 r+r ,, 

which concludes the proof. ■ 

The proof of Lemma 13.21 parallels that of Candes and Tao about the recovery of nearly sparse 
vectors from a limited number of measurements [7]. It is also inspired by the work of Fazel, Recht, 
Candes and Parrilo [13,22j. Set H = M — M and observe that by the triangle inequality, 

\\A*A(H)\\ < \\A*(A(M) -y)\\ + \\A*(y - A{M))\\ < 2A, (3.6) 

since M is feasible for the problem (jl.3|) . Decompose H as 

H = Hq + H c , 

where rank(F ) < 2r, M r H* = and M?H C = (see [22]). We have 

||M + > \\M r + # c ||* - ||M C ||* - H^oll* 

= 1 1 M r 1 1 * + || H c 1 1 * — || M c 1 1 * — || Hq I)*. 

Since by definition, ||M + i?||* < ||M||* < ||M r ||* + ||M C ||*, this gives 

\\H C \U <\\H \U + 2\\M C \U. (3.7) 

Next, we use a classical estimate developed in [11] (see also [22]). Let H c = Udiag(a)V* be the 
SVD of H c , where a is the list of ordered singular values (not to be confused with the noise standard 
deviation). Decompose H c into a sum of matrices Hi, H2, ■ ■ ., each of rank at most 2r as follows. 
For each i define the index set Ij = {2r(i — 1) + 1, ...,2ri}, and let Hi := C// i diag(cj/ i )V^; that is, 
Hi is the part of H c corresponding to the 2r largest singular values, H2 is the part corresponding 
to the next 2r largest and so on. A now standard computation shows that 

II-HjIIf < —7== ll#c||*, (3.8) 
j>2 V2r 

and thus 

^WHjWf < \\H \\ F + J 2 \\M C \\, 

since ||-ffo||* < v2r||-ffo||F by Cauchy-Schwarz. 
Now the restricted isometry property gives 

(1 - 6 4r ) WHo + m III < \\A(H + Hi)\\ 2 F , (3.9) 
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and observe that 

\\A(H + iJi)||| = (A(H + H X ),A{H - H j))- 

i>2 

We first argue that 



(.4(#o + #iM(#)> < \\H + H^f V4r\\A*A(H)\\. (3.10) 

To see why this is true, let UTiV* be the reduced SVD of Hq + H\ in which U and V are n x r', 
and £ is r' x r' with r' = rank(_ffo + -Hi) < 4r. We have 

(.A(flo + iZi),.A(H)> = {H + H 1 ,A*A(H)) 
= {Z,U*[A*A(H)]V) 

= \\H + H 1 \\ F \\U*[A*A(H)]V\\ F . 



The claim follows from \\U*[A*A(H)]V\\ F < Vr'\\A*A(H)\\, which holds since U*[A*A(H)]V is 
an r' x r' matrix with spectral norm bounded by ||^4*^4(i?)||. Second, Lemma |3~H1 implies that for 

i>2 

{A{H Q ),A{Hj)) < 6 4r \\H \\ F \\Hj\lF, (3.11) 

and similarly with H\ in place of Hq. Note that because Hq is orthogonal to Hi, we have that 
1 1 -Ho + Hi\\ 2 F = \\H Q \\ F + H-Hilll and thus \\H \\ F + \\Hi\\ F < s/2\\Hq + Hi\\ F . This gives 

{AiHo + H&AiHj)) < v / 254r||-Ho + -Hi||F||^|| J F- (3.12) 
Taken together, ($MB, (l3TTnjl and dSHZj) yield 

(l-tf4r)||#o + iZi||f < V&\\A*A(H)\\ + V2S 4r J2\\ H j\\F 

i>2 

< V&\\A*A(H)\\ + V25 4r \\H \\ F + ^\\M, 



c *■ 



To conclude, we have that 

\\H + Hi\\ F < Ci V^\\A*A(H)\\ + Ci^p\\M c \U, Ci = 1/[1 - (V2 + l)5 4r ] 



provided that C\ > 0. Our claim (|2.4|) then follows from (|3.6p together with 

||-H|| F < WHq + HxWp + J^WHjWf < 2||iJ" + -Hi|| F + J-HMcll*. 

i>2 V r 



3.4 Proof of Theorem 12^1 

Theorem 12.41 follows by simply plugging M r = M into Theorem 13.21 To generalize the results, note 
that there are only two requirements on M, A and y used in the proof. 

• \\A*(A(M)-y)\\<\ 
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• rank(M) = r and 5 ir < y/2-1. 
Thus, the steps above also prove the following Lemma which is useful in proving Theorem 12.6 



Lemma 3.4 Assume that X is of rank at most r and that 5^ r < — 1. Suppose X obeys \\A*(y — 
A{X))\\ < X. Then the solution M to (JO]) obeys 

\\M-Xf F < C rX 2 . (3.13) 

where Cq is a small constant depending only on the isometry constant 5^. 

3.5 Proof of Theorem [Ml 

In this section, A = \6no~ 2 and we take as given that ||^4*(z)|| < A/2 (and thus, by Lemma ll.ll the 
end result holds with probability at least 1 — 2e~ cn ). The novelty in this proof - the way it differs 
from analogous proofs in compressive sensing - is in the use of a middle estimate M. Define K as 



)ll? a > 7= 4(1 A +(5i) 



and let M = argmin^ K(X, M). In words, M achieves a compromise between goodness of fit 
and parsimony in the model with noiseless data. The factor 7 could be replaced by A 2 , but the 
derivations are cleanest in the present form. We begin by bounding the distance between M and 
M using the RIP, and obtain 

\\M-M\\l < —L-\\A(M)-A{M)\\i (3-15) 

1 — 02r 

where the use of the isometry constant <$2r follows from the fact that rank(M) < rank(M). 
We now develop a bound about \\M — M\\ 2 . Lemma 13.51 gives 

\\A*(y - A{M))\\ < \\A*(z)\\ + \\A*A(M - M)\\ < A, 

i.e. M is feasible for (|1.3p . Also, rank(M) < rank(M) and, thus, plugging M into Lemma 13.41 gives 

||M - M\\ 2 F < CX 2 rank(M). 

Combining this with (|3.15p gives 

||M - M||| < 2||M - M||| + 2||M - M||| 

< 2CA 2 rank(M) + 2 \\A(M) - A{M)\\% 

<C'K{M;M) (3.16) 

where C = max(8C(l + 8 X ), 2/(1 - 6 2r )). 

Now M is the minimizer of K(-; M), and so K(M; M) < K(M ; M), where 

M = <Ti(M)l WliM )>x} UiV*. (3.17) 
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We have 



K(M ;M) < 7 ^ h°iW)>x } + \\A(M - M Q )\\l 

i=l 
r 

^ 7 E 1 {^W>A} + (! + ^)H M " M o||f 

i=l 

r 

<(l + <5 r )Emm(A 2 )f 7f(M)). 

8=1 

In conclusion, the proof follows from A = lQna 2 since 

r 

\\M-M\\ 2 F < C / Emin(A 2 ,cj J 2 (M)). 

i=i 

Lemma 3.5 The minimizer M obeys 

\\A*A{M-M)\\ < A/2. 

Proof Suppose not. Then there are unit-normed vectors u,v £ R n obeying 

(uv*,A*A(M - M)) > A/2. 

We construct the rank-1 perturbation M' = M - auv*, a = (uv*,A*A(M - M)) /\\A(uv*)\\j , and 
claim that K(M' : M) < K(M; M) thus providing the contradiction. We have 

\\A(M' - M)\\l = \\A(M - M)\\l - 2a(A(uv*),A(M - M)) + a 2 \\A(uv*)\\l 
= \\A{M-M)\\l-a 2 \\A{uv*)\\l. 

It then follows that 

K(M';M) < 7 (rank(M) + 1) + \\A(M - M)\\j 2 - a 2 \\A(uv*)\\j 2 
= K(M;M) +1 -a 2 \\A(uv*)\\l. 

However, < (l+5i)||uv*|||i = 1+S 1 and, therefore, a 2 ||^l(TO*)||^ > 7 since (uv*,A*A{M- 

M)) > A/2. ■ 

3.6 Proof of Theorem [277] 

Three useful lemmas are established in the course of the proof of this more involved result, and we 
would like to point out that these can be used as powerful error bounds themselves. Throughout 
the proof, C is a constant that may depend on 5^ r only, and whose value may change from line to 
line. An important fact to keep in mind is that under the assumptions of the theorem, 5±f can be 
bounded, with high probability, by an arbitrarily small constant depending on the size of the scalar 
c\ appearing in the condition f < c\m/n. This is a consequence of Theorem 12.31 In particular, 
5 4T . < (y/2 - l)/2 with probability at least 1 - De~ dm . 
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Lemma 3.6 Let M and Mq be defined via (|3.14|) and (|3.17j) . and set 

r = max(rank(M), rank(Mo)). 
Suppose that 5^ r < y/2 — 1 and that A obeys \\A*(z)\\ < A/2. Then the solution M to (|1.3|) o&eys 

||M-M||^<Co^min(A^a ^ 2 (M)) + P(M-Mo)||^ , (3.18) 

where Co is a small constant depending only on the isometry constant 5^ r - 

Proof The proof is essentially the same as that of Theorem 12.61 and so we quickly go through the 
main steps. Set M c = M — Mq so that M c only contains the singular values below the noise level. 
First, 

||M - M\\ 2 F < 2\\M - Af 1|| + 2||M c ||f. 

< —^—\\A(M - M )\\l + 2\\M c f F 

1 — 02r 

< -A- \\A(M - M)\\l + ^-||^l(M c )||, 2 2 + 2||M C || 2 ,. 

1 — 02r J- — <J2r 

Second, we bound ||M — M\\f using the exact same steps as in the proof of Theorem 12.61 an d 
obtain 

||M-M||| < CrA 2 . 

Hence, 

||M-M||| < C(K{M;M) + \\A{M c )\\l 2 + ||M C |||). 

Finally, use K(M; M) < K(M ; M) as before, and simplify to attain (f3T8|h ■ 

The factor ^(Mc)!^ in (|3. 18j) prevents us from stating the bound as the near-ideal bias- 
variance-trade-off (|2.15|) . However, many random measurement ensembles obeying the RIP are 
also unlikely to drastically change the norm of any fixed matrix (see (|2.2|) ). Thus, we expect that 
||.4(M C )||4 w ||M C ||4 with high probability. Specifically, if A obeys (|23|) . then 

P(M C )|| 2 2 < 1.5||A^ C ||2, (3.19) 

with probability at least 1 — De~ cm for fixed constants D, c. An important point here is that 
this inequality only holds (with high probability) when M c is fixed, and A is chosen randomly 
(independently). In the worst-case-scenario, one could have 



P(M C )||, 2 = P||||M ( 



c\\F 



where ||^4|| is the operator norm of A. Thus we emphasize that the bound holds with high proba- 
bility for a given M verifying our conditions, but may not hold uniformly over all such M's. 
Returning to the proof, (I3.24p together with 



\Mc\\ F = Y;^( M ) 1 WdM)<X} 
i=l 



give the following lemma: 
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Lemma 3.7 Fix M and suppose A obeys (|2.2|) . Then under the assumptions of Lemma \3.6l the 
solution M to (I1.3P obeys 

n 

\\M - M||| < C min(A 2 , of (M)) (3.20) 
i=l 

wrai/i probability at least 1 — De~ cn where Co is a small constant depending only on the isometry 
constant 5^ r , and c, D are fixed constants. 

The above two lemmas require a bound on the rank of Mq. However, as the noise level ap- 
proaches zero, the rank of M$ approaches the rank of M, which can be as large as the dimension. 
This requires further analysis, and in order to provide theoretical error bounds when the noise level 
is low (and M has full rank, say), a certain property of many measurement operators is useful. We 
call it the NNQ property, and is inspired by a similar property from compressive sensing, see |25j . 

Definition 3.8 (NNQ) Let B™ xn be the set of n x n matrices with nuclear norm bounded 1. Let 
B^ be the standard £2 unit ball for vectors in R m . We say that A satisfies NNQ(a) if 

A(Bl lXn ) D <xBf 2 . (3.21) 

This condition may appear cryptic at the moment. To give a taste of why it may be useful, note 
that Lemma [3 .21 includes ||M — M r ||* as part of the error bound. The point is that using the NNQ 
condition, we can find a proxy for M — M r , which we call M, satisfying A(M) = A(M — M r ), but 
also ||M||* < ||^4(M — M r )\\i 2 /a. Before continuing this line of thought, we prove that Gaussian 
measurement ensembles satisfy NNQ(/iy /n /' m ) with high probability for some fixed constant \i > 0. 

Theorem 3.9 (NNQ for Gaussian measurements) Suppose A is a Gaussian measurement 
ensemble and m < Cn 2 / log(m/n) for some fixed constant C > 0. Then A satisfies NNQ{fMy/n/m) 
with probability at least 1 — 3e~ cn for fixed constants c and \x. 

Proof Put 01 = [iy/n/m and suppose A does not satisfy NNQ (a). Then there exists a vector 
x G R m with ||x||^ 2 = 1 such that 

(A(M), x) < a for all M e Bl lXn . 

In particular, 

\\A*(x)\\ < a. 

Let Bf 2 C B% be an a-net for B 7 ^ with |B^| < (3/a) m . Then there exists x G B~% with ||x-x||^ a < 
a satisfying 

\\A*(x)\\ < \\A*(x-x)\\ + \\A(x)\\ < (uv*,A*(x-x)) + a, 

where u, v are the left and right singular vectors of A* (x — x) corresponding to the top singular 
value. Then 

(uv* ,A*(x — x)} = (A(uv*),x — x) < \\A(uv*)\\i 2 \\x — x\\e 2 < y 1 + <5i a 
and, therefore, 

\\A*(x)\\ < 3a 
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assuming <5i < 1 (this occurs with probability at least 1 — 2e cn when m > Cn for fixed constants 
c,C). 

We will provide the contradiction by showing that with high probability, ||„4*(x)|| > 3a, for 
all x E B™ xn . For each x, A*(x) is equal in distribution to 77=^5 where Z is a matrix with 
i.i.d. standard normal entries. Let Z; be the ith column of Z. Then 



P(||-4*(x)|| < 3a) < P(||Z|| < 3y/ma) 

< P( max \\Zi\\i 2 < 3\/rna); 
i=l, ...,n 

the second step uses the fact that the operator norm of Z is always larger or equal to the £2 norm 
of any column. With a = \i^J n/m and using the fact that the columns are independent, this yields 

P(P*(x)||<3a)<P(||Z 1 ||2 2 <9/i 2 n) n - 

However, ||^i||f is a chi-squared random variable with n degrees of freedom, and can be bounded 
using a standard concentration of measure result [19] : 

P(||Zi||| 2 - n < -tVtoi) < e~ <2/2 . 

Hence, 

P(P*(x)|| < 3a) < e~ cn \ 
where c = (1 — 9/i 2 ) 2 /4 (we require \x < 1/3 here). Thus, by the union bound, 



P I min \\A*(x)\\ < 3a I < (3/a) m e" c ^ = exp ( mlog ( - C n 2 ) < e 



\ l 2 / \ \ r v 

provided that m < Cn 2 /log(m/n) for fixed constants, C,c' . The theorem is established. ■ 

Note that the preceding proof can be repeated when A is a sub-Gaussian measurement ensemble; 
the only difference is that Z above will contain sub-Gaussian entries, rather than Gaussian entries. 

Using the NNQ property, we can now bound the error when the noise level is low; this does not 
involve any condition on the rank of Mq, and does not involve a term in the bound depending on 
||Af- Moll- 
Lemma 3.10 Suppose that A satisfies NNQ([iy/n/m) for a fixed constant [i and that ||^l*(z)|| < A. 
Let f > cm/n for some fixed numerical constant c, and suppose that that 5$f < ^(V2 — 1). Let 

f 

i=l 

Let M be the solution to (II. 3p . Then 

||M-M|| F < C(Av / f+ \\A(M- M f ) \\t 2 ) + ||M- M?\\ F . (3.22) 
Proof Set M c = M - Mf = E"=f+l <7i(M)iiiV?. The NNQ(a) property with a = fi^n/m gives 

A{M C ) = A(M) 
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for some M obeying ||M||* < \\A(M c )\\ h /a. We also take note of the identity A(M f + M) = A(M). 
It follows from Lemma 13.21 that 



|| M — (Mr + M)\\f < C(\Vr+ \\M\yVf). 
Plugging in ||M||* < ||-4,(M c )||^ 2 /a, along with f > cm/n, we obtain 

\\M - (M, + M)\\ F < C(\Vr + \\A(M c )\\ h ). 

Therefore, 

\\M - M\\ F < C(\Vf + \\A(M c )\\ l2 ) + \\M\\ F + \\M C \\ F . (3.23) 

It remains to bound ||M||i?. As in the proof of Lemma [3T2l decompose M as M = M1 + M2 + . . . 
so that M\ corresponds to the largest f singular values of M, M2 corresponds with the next f largest, 
and so on. Just as before, 



\M\\f < W M i\\F ^ WiWf + \\M\\*/V?. 



We now bound ||Mi||.f. By the RIP, 

||Mi||f< ^=11^)114 



1 




5f 


1 




Vi- 




1 





i>2 

By the RIP again, ||^(Mj)|| fe < s/l + S f \\Mi\\ F , and so 

U(Mi)\\e 2 < y/l + 8? WiWf < ^1+^^ 

i>2 i>2 V T 

This together with A(M) = A(M C ) give 



,M|, y£ ^(p ( M d |, & + ». 



However, ||M||* < \\A(M)\\ h /a < y/f\\A(M)\\iJ(fiy/c) and, therefore, 

\\M\\ F < C\\A(M c )\\ i2 . 

Inserting this into (|3.23|) completes the proof of the lemma. ■ 

We are now in position to prove our main theorem concerning the recovery of matrices with 
decaying singular values (Theorem 12. 7p . There are three cases to consider depending on the number 
of singular values of M standing above the noise level. In each case, we need the inequality 

P(M C )||£ < 1.5||M C ||! (3.24) 
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which holds with probability at least 1 — De cn for any measurement ensemble satisfying (|2.2|) 
(including the Gaussian measurement ensemble). Put A = 16na 2 and recall the definition of Mo: 

it 

M = ^CTi(M)l {CT . (M) > A} UiV* 

i=l 

whose rank is exactly the number of singular values of M above the noise level. There are three 
cases to consider depending mostly on the interplay between the singular values of M and the noise 
level. 

Case 1: high noise level 

Suppose K(Mq;M) < 4 ^ +Si ^ r- Then rank(Mo) < r and rank(M) < f by definition of M. Hence, 
Lemma 13.71 gives 

n 

\\M-M\\ 2 F < C^2mm(na 2 ,af(M)) 

i=l 

with probability at least 1 — 2e~ cn . 
Case 2: low noise level 

Suppose K(M ; M) > 4(1 + gl) r and rank(M ) > f. It follows from (J23J) that 

\\A(M - Mr)\\% < vT5||M - M f f F (3.25) 

with probability at least 1— De~ cn . Now, for the Gaussian measurement ensemble, the requirements 
of Lemma 13.101 are met with probability at least 1 — Ce~ cn . Combining (|3.25p with Lemma 13.101 
yields 

||M - M\\ F < C(\Vf + \\M - M f \\ F ) 

and thus 

(f n 
^mi^A 2 ,^^^^)) + ^ a f( M ) 
i=l i=r+l 

Since A = 16?i<r 2 , this is (|3.23D . 
Case 3: medium noise level 

Suppose K{Mq] M) > 4?j+gjj ^ an d rank(Mo) < f. As in Case 2, we have 

\\M - M\\ 2 F < 2C 2 (X 2 r + || Af - M? |||). 
From A 2 f < 4(1 + 6 1 )K(M ; M), it follows that 

||M - M\\ 2 F < 2C 2 (A 2 rank(Af ) + 4(1 + 5i)\\A(M - M )||| + ||M - M P |||). 

We also have \\A(M - M )\\j 2 < 1.5||M - M ||| with probability at least 1 - De~ cn . Inserting 
this bound into the previous equation, along with ||M — M f ||^ < ||M — Mo||_f, gives the desired 
conclusion. 

These three cases comprise all possibilities. In short, the proof of Theorem 12.71 is complete. 
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3.7 Extension of proofs to the solution to the Lasso (jl.5p 



In the sparse regression setup, Bickel et al. [3] showed that the Dantzig Selector and the Lasso have 
analogous properties, leading to analogous error bounds. The analogies still hold in the low-rank 
matrix recovery problem (for similar reasons). In fact, all of the theorems above also hold for the 
solution to (|1.5p aside from a shift in those constants appearing in the assumptions, and those 
appearing in the error bounds. To see this, note that our proofs only used two crucial properties 
about M: 

1. ||M||* < ||M||„ 

2. \\A*(A(M)-y)\\<\. 

The second property automatically holds for the solution to (jl.5p (but with A replaced by /x). This 
follows from the optimality conditions which states that A*(y — A(M)) G <9||M||* where ||M||* is 
the family of subgradients to the nuclear norm at the minimizer. Formally, let UT,V* be the SVD 
of M, then 

A*{y - A(M)) = \{UV* + W) 

for some W obeying ||W|| < 1 and U*W = 0, WV = (see e.g. [TO]). Hence, the second property 
follows from \\UV* + W\\ < 1. 

The first property does not necessarily hold for the matrix Lasso, but a close enough approx- 
imation is verified (this is analogous to an argument made in [3]). Suppose that ||.A*(.z)|| < CQfi 
for a small constant cq (which, by Lemma holds with high probability for Gaussian noise if 
fi = Cna 2 ). Then since M minimizes (jl,5p . we have 

~\\A(M)-y\\l+»\\M\U < ^\\A(M)-y\\l+»\\M\U. 
Plug in y = A(M) + z and rearrange terms to give 

mII^II* < i||>l(M-M)||^+^||M||, < {M-M,A*(z))+fi\\M\\*. 

Since the nuclear norm and the operator norm are dual to each other, we have (M — M,A*(z)) < 
\\M — M\\* ■ \\A*(z)\\ < co/i||i?||*, where we use the notation H = M — M as in the proof of Lemma 
13.21 This gives 

||M||* < c ||i/||* + ||M||*, 

which nearly is the first property. When co is chosen to be a small constant, this factor has no 
essential detrimental effects on the proof. In particular, (|3.7p in the proof of Lemma [3. 21 is replaced 
by 

(l-c )\\H c \U < (l + co)||i?o||* + 2||M c ||*. 

In particular, for cq = 1/2, 

H-Hcll* < 3||i/"o||* +4||M C ||». 

The rest of the proofs follow. 
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3.8 Proof of Theorem 1231 



We begin with a well-known lemma which gives the minimax risk for estimating the vector i£R™ 
from the data y S [R m and the linear model 

y = Ax + z, (3.26) 
where A G R mxn and the z^s are i.i.d. 7^(0, a 2 ). 

Lemma 3.11 Let Xi(A* A) be the eigenvalues of the matrix A* A. Then 

2 

inf sup E \\x - x\\ 2 = a 2 trace^M)" 1 ) = V - ° (3.27) 

X xeR n Ai{A A) 

In particular, if one of the eigenvalues vanishes ( as in the case in which m < n), then the minimax 
risk is unbounded. 

Proof Suppose first that A is the identity matrix. Then it is well known that the minimax risk is 
na 2 and is achieved by x = y. To see this, recall that for any prior on x, the minimax risk is lower 
bounded by the Bayes risk. Consider then the prior which assumes that all the components of x 
are i.i.d. M(0, t 2 ). Then the Bayes' estimator for this prior is the shrinkage estimate given by 

T 2 



%i — E(xj|yj) — 



and the Bayes risk is 



n 2 

V] E(xi - Xi) 2 = na 2 -~- — 
i=i 



Clearly, as r — > oo, the lower bound on the minimax risk goes to na 2 . Since this quantity is the 
risk of the maximum-likelihood estimate y, this proves the claim. Note that by a simple rescaling 
argument, this also proves that the minimax risk for estimating x from yi = diXi + Zi is X^?=i V^i • 
We can now prove (|3.27p . We will assume that m > n for simplicity since for m < n, the 
minimax risk is unbounded. Let £/£V* be the SVD of A, where U is m x n, S is n x n and V is 
n x n. All the information about x is in U*y, and so we may just assume that the data is given by 

y' = U*y = ZV*x + U*z. 

Now z' = U*z is a Gaussian vector with i.i.d. Af(0,a 2 ) components. Further, set x' = V*x. Since 
V is an orthogonal matrix, the minimax risk for estimating x or x' is the same and, therefore, our 
problem is that of computing the minimax risk for estimating x' from 

y' = Ex' + z' . 



Since S is a diagonal matrix with diagonal elements y/ Xi(A*A), our previous result applies and 
establishes (pOTD . ■ 

We are now in position to prove Theorem 12.51 The set of rank-r matrices is (much) larger than 
the set of matrices of the form 

M = UK 
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where U is a fixed orthogonal n x r matrix with orthonormal columns (note that the matrices of 
this form have a fixed r-dimensional column space). Thus, 



inf sup E||M-M||| > inf sup E||M-M||£. 

M M:rank(M)=r M M:M=UR 

Knowing that M = UR for some unknown r x n matrix R, one can of course limit ourselves to 
estimators of the form M = UR, and since 

E \\M - M\\ 2 F = E \\UR — UR\\% = E \\R - R\\ 2 F , 

the minimax risk is lower bounded that by of estimating R from the data 

y = Au(R) + z, 

where Ajj is the linear map (|2.11|) . We then apply Lemma 13. Ill to conclude that the minimax rate 
is lower bounded by 



E 



a 1 



HAijAu)' 

The claim follows from the simple lemma below. 

Lemma 3.12 Let U be an n x r matrix with orthonormal columns. Then all the eigenvalues of 
Ay Ajj belong to the interval [1 — S r , 1 + 8 r ]. 

Proof By definition, 

\ m m(A* u A u )= inf (R,AuAu(R)) 
\\R\\f<1 

and similarly for A max (^l^^lr/) with a sup in place of inf. Since 

(R,AhAu(R)) = \\Au(R)\\l = \\A(UR)\\ 2 , 

the claim follows from 

(1 - 6 r )\\URf F < \\A(UR)\\ 2 < (1 + 5 r )\\UR\\ 2 F , 
which is valid since rank(C/i?) < r together with ||C/ii||^ = ■ 

4 Discussion 

Using RIP-based analysis, this paper has shown that low-rank matrices can be stably recovered via 
nuclear-norm minimization from nearly the minimal possible number of linear samples. Further, 
the error bound is within a constant of the expected minimax error, and of an expected oracle 
error, and extends to the case when M has full rank. 

This work differs from the main thrust of the recent literature on low-rank matrix recovery, which 
has concentrated on the 'RIPless' matrix completion problem. An interesting observation regarding 
matrix completion is that when the measurements are randomly chosen entries of M, one requires 
at least about nr log n measurements to recover M by any method when rank(M) = 0(1) [HlllOj. 
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In contrast, this paper shows that on the order of nr measurements are enough provided these are 
sufficiently random. 

The popularity of the matrix completion model stems from the fact that this setup currently 
dominates the applications of low-rank matrix recovery. There are far fewer applications in which 
the measurements are random linear combinations of many entries of M (quantum-state tomog- 
raphy is a notable application though). As a great deal of attention is given to low-rank matrix 
modeling these days, with new applications being discovered all the time, this may change rapidly. 
We hope that our theory encourages further applications and research in this direction. 
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